UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics
نویسندگان
چکیده
Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing from a database. To meet customers’ pressing demands, database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes offer User-Defined Aggregate (UDA), a data-driven operator, to implement many of the analytical techniques in parallel. However, UDAs can not be used to implement statistical algorithms such as Markov chain Monte Carlo (MCMC), where most of the work is performed by iterative transitions over a large state that can not be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to setup the large state in the first place and demands post-processing after the statistical inference. This paper presents General Iterative State Transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA, and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA-GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile applications: cross-document coreference and image denoising. We show that the in-database framework allows us to tackle a 27 times larger problem than solved by the state-of-the-art for the first application and achieves 43 times speedup over the state-of-the-art for the second application.
منابع مشابه
Parallel computation framework for optimizing trailer routes in bulk transportation
We consider a rich tanker trailer routing problem with stochastic transit times for chemicals and liquid bulk orders. A typical route of the tanker trailer comprises of sourcing a cleaned and prepped trailer from a pre-wash location, pickup and delivery of chemical orders, cleaning the tanker trailer at a post-wash location after order delivery and prepping for the next order. Unlike traditiona...
متن کاملComparing Parallel Simulated Annealing, Parallel Vibrating Damp Optimization and Genetic Algorithm for Joint Redundancy-Availability Problems in a Series-Parallel System with Multi-State Components
In this paper, we study different methods of solving joint redundancy-availability optimization for series-parallel systems with multi-state components. We analyzed various effective factors on system availability in order to determine the optimum number and version of components in each sub-system and consider the effects of improving failure rates of each component in each sub-system and impr...
متن کاملParallel Time Series Modeling - A Case Study of In-Database Big Data Analytics
MADlib is an open-source library for scalable in-database analytics. In this paper, we present our parallel design of time series analysis and implementation of ARIMA modeling in MADlib’s framework. The algorithms for fitting time series models are intrinsically sequential since any calculation for a specific time t depends on the result from the previous time step t − 1. Our solution paralleli...
متن کاملLoad Sharing Control of Parallel Inverters with Uncertainty in the Output Filter Impedances for Islanding Operation of AC Micro-Grid
Parallel connection of inverter modules is a solution to increase reliability, efficiency and redundancy of inverters in Micro-Grid system. Proper load sharing among parallel inverters is a key point. The circulating current among the inverters can greatly reduce the efficiency or even cause instability of the system. In this paper, a control strategy for improving the load sharing performance ...
متن کاملAn Intelligent Computer Interface Utilizing Parallel Picocontrollers (TECHNICAL NOTE)
The design of an interface unit is described, in which RS232 serial data is converted to latched parallel data on 22 independent lines. The data direction of each line is programmable through the serial port. Two picocontrollers are employed in a parallel processing mode to give the required number of I/O pins, and data on the shared serial line is coded to separate data streams to the individu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 8 شماره
صفحات -
تاریخ انتشار 2015